-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16979 control: Reduce frequency of hugepage allocation at runtime #15848
base: master
Are you sure you want to change the base?
Conversation
Test-tag-hw-medium: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
Ticket title is 'Mitigation against hugepage memory fragmentation' |
@phender as discussed in order to try to reproduce the DMA grow failure (DAOS-16979) related to hugepage fragmentation I ran this PR with |
…gemem-no-fragment Signed-off-by: Tom Nabarro <[email protected]>
Test-tag: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.
If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.
minHugepages, maxHugepages, cfgTargetCount, largeTargetCount, msgSysXS) | ||
|
||
if minHugepages > maxHugepages { | ||
log.Debugf("config hugepage requirements exceed normal maximum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be logged at NOTICE level? Who is it for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a debug message because the user doesn't need to do anything about it, it is an indication that the configuration requires more huge pages than the normal maximum
return FaultConfigHugepagesDisabledWithBdevs | ||
} | ||
if minHugepages != 0 { | ||
log.Noticef("hugepages disabled but targets will be assigned to bdevs, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree with doing something here, but logging "caution is advised" is not particularly helpful, IMO. Is it an error or not? What is the admin supposed to do if/when they happen to notice this message in the server log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to indicate that the server is operating in an unusual mode, the administrator should be aware of that
I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection. |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1420/log |
FWIW I've seen similar on Aurora after a fresh reboot: |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1565/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1518/log |
I don't doubt that there is a problem... My concern is more that it seems like the actual problem is not yet understood, and the proposed approach in this PR is a potential solution for a very specific set of scenarios. Adding a hard-coded configuration for hugepages kind of defeats the purpose of having a configuration mechanism, and it seems likely to cause unintended problems for configurations that are outside of what's being hard-coded in this PR. |
Yes, there could be other unknown issues to be solved (As @daltonbohning mentioned that allocation failure was seen after a fresh reboot when the memory isn't supposed to be fragmented), but allocating hugepages at run time (setting nr_hugepages) is believed likely generating fragmentations. I think our goal is to avoid allocating hugepages at run time when possible, no matter for production or testing system. |
Test-tag: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
return nil | ||
} else if minHugepages == 0 { | ||
// Enable minimum needed for scanning NVMe on host in discovery mode. | ||
if cfg.NrHugepages < scanMinHugepageCount && mi.HugepagesTotal < scanMinHugepageCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am probably missing something but I was thinking that mi.HugepagesTotal
was the number of available Huge Pages, and thus cfg.NrHugePages
could not be greater than this first value.
// allocate on numa node 0 (for example if a bigger number of hugepages are | ||
// required in discovery mode for an unusually large number of SSDs). | ||
prepReq.HugepageCount = srv.cfg.NrHugepages | ||
srv.log.Debugf("skip allocating hugepages, no change is required") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it not be an error to have bdev without huge pages ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After chatting with @NiuYawei decided that we should support emulated NVMe with or without hugepages as some usage models may not require them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
NIT, the error FaultConfigHugepagesDisabledWithBdevs
raised at line 568 should probably be renamed to something such as FaultConfigHugepagesDisabledWithNvmeBdevs
.
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1220/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1429/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1522/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1569/log |
Reduce the frequency of hugepage allocation change requests made to
the kernel during daos_server start-up. Check total hugepages on start
and only request more from kernel if the recommended number calculated
based on server config file content is more than the existing system total.
On first start of server process after a reboot, allocate arbitrarily
large number e.g. enough for 16*2 engine targets. The ambition being
to allocate once and reduce the chance of fragmentation.
Test-tag: pr daily_regression
Allow-unstable-test: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: